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Abstract 

An information-theoretic development is given for the problem of compound Pois- 
son approximation, which parallels earlier treatments for Gaussian and Poisson ap- 
proximation. Let Ps^ be the distribution of a sum Sn = X^iLi independent 
integer- valued random variables Yi. Nonasymptotic bounds are derived for the dis- 
tance between Ps^ and an appropriately chosen compound Poisson law. In the case 
where all Yi have the same conditional distribution given {Yi 7^ 0}, a bound on the rel- 
ative entropy distance between Pg^ and the compound Poisson distribution is derived, 
based on the data-processing property of relative entropy and earlier Poisson approx- 
imation results. When the Yi have arbitrary distributions, corresponding bounds are 
derived in terms of the total variation distance. The main technical ingredient is the 
introduction of two "information functionals," and the analysis of their properties. 
These information functionals play a role analogous to that of the classical Fisher 
information in normal approximation. Detailed comparisons are made between the 
resulting inequalities and related bounds. 
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1 Introduction and main results 



The study of the distribution of a sum Sn = X^"=i Yi of weakly dependent random vari- 
ables Yi is an important part of probability theory, with numerous classical and modern 
applications. This work provides an information-theoretic treatment of the problem of 
approximating the distribution of Sn by a compound Poisson law, when the Yi are dis- 
crete, independent random variables. Before describing the present approach, some of the 
relevant background is briefly reviewed. 



1.1 Normal approximation and entropy 

When Yi,Y2, . . . ,Yn are independent and identically distributed (i.i.d.) random variables 
with mean zero and variance cr^ < oo, the central limit theorem (CLT) and its various 
refinements state that the distribution of T„ := (1/-*/^) SiLi close to the A^(0,c7^) 
distribution for large n. In recent years the CLT has been examined from an information- 
theoretic point of view and, among various results, it has been shown that, if the Yi have a 
density with respect to Lebesgue measure, then the density /t„ of the normalized sum T„ 
converges monotonically to the normal density with mean zero and variance cr^; that is, 
the entropy h{fT„) ■= — f fT„ log /r„ of /t„ increases to the A''(0, cr^) entropy as n — oo, 
which is maximal among all random variables with fixed variance cr^. [Throughout, 'log' 
denotes the natural logarithm.] 

Apart from this intuitively appealing result, information-theoretic ideas and techniques 
have also provided nonasymptotic inequalities, for example giving accurate bounds on the 
relative entropy D{fT„\\(p) ■= J fT„ ^og{fT„/(p) between the density of r„ and the limiting 
normal density (j). Details can be found in [8, 17, 15, 3, 2, 31, 26] and the references in 
these works. 

The gist of the information-theoretic approach is based on estimates of the Fisher in- 
formation, which acts as a "local" version of the relative entropy. For a random variable Y 
with a differentiable density / and variance < oo, the (standardized) Fisher information 
is defined as. 



|iog/(y)-^iog<A(>^) 



Jn{Y) := E 

where cp is the N{0,a'^) density. The functional Jj\[ satisfies the following properties: 

dy 



(A) JAr(y) is the variance of the (standardized) score function, ry(y) := -M-log f(y) 



^log</.(2/), y GR. 

(B) JAr(y) = if and only if Y is Gaussian. 

(C) Jn satisfies a subadditivity property for sums. 

(D) If J]\f(Y) is small then the density / of y is approximately normal and, in particular, 
-D(/||(/>) is also appropriately small. 
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Roughly speaking, the information-theoretic approach to the CLT and associated normal 
approximation bounds consists of two steps; first a strong version of Property (C) is used 
to show that JAr(r„) is close to zero for large n, and then Property (D) is applied to obtain 
precise bounds on the relative entropy Z)(/t„ ||(/>). 



1.2 Poisson approximation 

More recently, an analogous program was carried out for Poisson approximation. The 
Poisson law was identified as having maximum entropy within a natural class of discrete 
distributions on Z+ := {0,1,2,...} [14, 30, 16], and Poisson approximation bounds in 
terms of relative entropy were developed in [21]; see also [19] for earlier related results. 
The approach of [21] follows a similar outline to the one described above for normal 
approximation. Specifically, for a random variable Y with values in and distribution 
P, the scaled Fisher information of Fwas defined as, 

MY) := XE[pY{Yf] = AVar(py(y)), (1.1) 

where A is the mean of Y and the scaled score function py is given by, 

{y + l)P{y + l) 

py(y) ■= — xp(^) ' ^ - ^ ^ ^ 

[Throughout, we use the term 'distribution' to refer to the discrete probability mass func- 
tion of an integer-valued random variable.] 

As discussed briefly before the proof of Theorem 1.1 in Section 2 the functional Jn{Y) 
was shown in [21] to satisfy Properties (A-D) exactly analogous to those of the Fisher 
information described above, with the Poisson law playing the role of the Gaussian dis- 
tribution. These properties were employed to establish optimal or near-optimal Poisson 
approximation bounds for the distribution of sums of nonnegative integer-valued random 
variables [21]. 



1.3 Compound Poisson approximation 

This work provides a parallel treatment for the more general - and technically significantly 
more difficult - problem of approximating the distribution Ps„ of a sum Sn = Y17=i ^ 
of independent Z+-valued random variables by an appropriate compound Poisson law. 
This and related questions arise naturally in applications involving counting; see, e.g., 
[7, 1, 4, 12]. As we will see, in this setting the information-theoretic approach not only 
gives an elegant alternative route to the classical asymptotic results (as was the case in 
the first information-theoretic treatments of the CLT), but it actually yields fairly sharp 
finite-n inequalities that are competitive with some of the best existing bounds. 

Given a distribution Q on N = {1,2,...} and a A > 0, recall that the compound 
Poisson law CPo(A, Q) is defined as the distribution of the random sum ^^^-^i, where 
Z ^ Po(A) is Poisson distributed with parameter A and the Xi are i.i.d. with distribution 
Q, independent of Z. 
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Relevant results that can be seen as the intellectual background to the information- 
theoretic approach for compound Poisson approximation were recently established in [18, 
33], where it was shown that, like the Gaussian and the Poisson, the compound Poisson 
law has a maximum entropy property within a natural class of probability measures on Z+ . 
Here we provide nonasymptotic, computable and accurate bounds for the distance between 
Ps^ and an appropriately chosen compound Poisson law, partly based on extensions of the 
information-theoretic techniques introduced in [21] and [19] for Poisson approximation. 

In order to state our main results we need to introduce some more terminology. When 
considering the distribution of S„ = X^^Li^, we find it convenient to write each Yi as 
the product BiXi of two independent random variables, where Bi takes values in {0, 1} 
and Xi takes values in N. This is done uniquely and without loss of generality, by taking 
Bi to be Bern(pj) with pi = Vx{Yi ^ 0}, and Xi having distribution Qi on N, where 
Qi{k) = Pr{y, = k\Y,>l] = Vi{Yi = k}/p^, for k > 1. 

In the special case of a sum Sn = Y17=i ^ random variables Yi = BiXi where all 
the Xi have the same distribution Q, it turns out that the problem of approximating Ps^ 
by a compound Poisson law can be reduced to a Poisson approximation inequality. This 
is achieved by an application of the so-called "data-processing" property of the relative 
entropy, which then facilitates the use of a Poisson approximation bound established in 
[21]. The result is stated in Theorem 1.1 below; its proof is given in Section 2. 



Theorem 1.1. Consider a sum Sn = ^ of independent random variables Yi = BiX,^ 
where the Xi are i.i.d. ~ Q and the Bi are independent Bern(pj). Then the relative entropy 
between the distribution Ps„ of Sn and the CPo(A, Q) distribution satisfies, 



D{PsJCPo{X,Q))<^j2S 

1=1 



3 

Pi 



where A := Y17=iPi- 

Recall that, for distributions P and Q on Z+, the relative entropy, or Kullback-Leibler 
divergence, D{P\\Q), is defined by. 



D{P\\Q) := P(x)log 



P{x) 



Q{x) 



Although not a metric, relative entropy is an important measure of closeness between 
probability distributions [10] [11] and it can be used to obtain total variation bounds via 
Pinsker's inequality [11], 

dTv{P,Qf < ^D{P\\Q), 
where, as usual, the total variation distance is 

dTY{P,Q) := I V \P{x) - Q{x)\ = max |P(^) - Q{A)\. 
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In the general case where the distributions Qi corresponding to the Xi in the summands 
Yi = BiXi are not identical, the data-processing argument used in the proof of Theorem 1.1 
can no longer be applied. Instead, the key idea in this work is the introduction of two 
"information functionals," or simply "informations," which, in the present context, play 
a role analogous to that of the Fisher information J^r and the scaled Fisher information 
Jjr in Gaussian and Poisson approximation, respectively. 

In Section 3 we will define two such information functionals, Jq^i and Jq,2) and use 
them to derive compound Poisson approximation bounds. Both Jq i and Jq^2 will be seen 
to satisfy natural analogs of Properties (A-D) stated above, except that only a weaker 
version of Property (D) will be established: When either jQ^i(y) or Jq,2(^) is close to 
zero, the distribution of Y is close to a compound Poisson law in the sense of total variation 
rather than relative entropy. As in normal and Poisson approximation, combining the 
analogs of Properties (C) and (D) satisfied by the two new information functionals, yields 
new compound Poisson approximation bounds. 

Theorem 1.2. Consider a sum Sn = XlILi of independent random variables Yi = BiXi, 
where each Xi has distribution Qi on N with mean qi, and each Bi ~ Bern(pj). Let 
A = YJi=iPi "-"^d Q = Y17=i tQ«- Then, 



dTyiPs„,CFoiX,Q)) < H{\,Q)q 



Pi 



,=1 



1/2 

+ ^(Q) 



where Ps„ is the distribution of Sn, q = "^^=1 Q) denotes the Stein factor defined 

in (1.4) below, and -D(Q) is a measure of the dissimilarity of the distributions Q = {Qi), 
which vanishes when the Qi are identical: 

oo n 

DiQ):=Y.Y.^-fmj)-Q{j)\. (1.3) 

j=l i=l ^ 

Theorem 1.2 is an immediate consequence of the subadditivity property of Jq^i estab- 
lished in Corollary 4.2, combined with the total variation bound in Proposition 5.3. The 
latter bound states that, when Jq^i(Y) is small, the total variation distance between the 
distribution of Y and a compound Poisson law is also appropriately small. As explained 
in Section 5, the proof of Proposition 5.3 uses a basic result that comes up in the proof 
of compound Poisson inequalities via Stein's method, namely, a bound on the sup-norm 
of the solution of the Stein equation. This explains the appearance of the Stein factor, 
defined next. But we emphasize that, apart from this point of contact, the overall method- 
ology used in establishing the results in Theorems 1.2 and 1.4 is entirely different from 
that used in proving compound Poisson approximation bounds via Stein's method. 

Definition 1.3. Let Q be a distribution on N. If {jQ{j)} is a non-increasing sequence, 
set 6 = [X{Q{1) - 2Q(2)}]-i and let. 



HoiX,Q) 



1 ifS>l 
V6{2-V6) if6<l. 
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For general Q and any A > 0, the Stein factor H[\, Q) is defined as: 



H{X,Q) 



Ho{X,Q), 



mill 



1 

XQIT) 



if {jQ{j)} is non-increasing 
otherwise. 



Note that in the case when ah the Qi are identical, Theorem 1.2 yields, 



(1.4) 



dTYiPs,., CPo{X, Q)r < H{X,Qy q 



2„2 V. 



^ 1 

4 = 1 



Pi 



(1.5) 



where q is the common mean of the Qi = Q, whereas Theorem 1.1 combined with Pinsker's 
inequality yields a similar, though not generally comparable, bound. 



dTyiPs„,CPoiX,Q)f < 



1 " 

2A 



Pi 



Pi 



(1.6) 



See Section 6 for detailed comparisons in special cases. 

The third and last main result, Theorem 1.4, gives an analogous bound to that of 
Theorem 1.2, with only a single term in the right-hand-side. It is obtained from the sub- 
additivity property of the second information functional Jq,2) Proposition 4.3, combined 
with the corresponding total variation bound in Proposition 5.1. 

Theorem 1.4. Consider a sum Sn = X^"=i ^ of independent random variables Yi = BiXi, 
where each Xi has distribution Qi on N with mean qi, and each Bi ~ Bern(pj). Assume 
all Qi have have full support on N, and let X = Yll=iPi> Q — J27=i T^«' ^''^^ denote 
the distribution of Sn- Then, 



dTv(P5„,CPo(A,g)) < H{X,Q) \ 



. i=l 



PiYQiiy)y' 



Qfjy) 
•^Q^^y) 



1/2 



where Q*"^ denotes the convolution Qi * Qi and H{X, Q) denotes the Stein factor defined 
in (1.4) above. 

The accuracy of the bounds in the three theorems above is examined in specific ex- 
amples in Section 6, where the resulting estimates are compared with what are probably 
the sharpest known bounds for compound Poisson approximation. Although the main 
conclusion of these comparisons - namely, that in broad terms our bounds are competitive 
with some of the best existing bounds and, in certain cases, may even be the sharpest ~ 
is certainly encouraging, we wish to emphasize that the main objective of this work is the 
development of an elegant conceptual framework for compound Poisson limit theorems via 
information-theoretic ideas, akin to the remarkable information-theoretic framework that 
has emerged for the central limit theorem and Poisson approximation. 
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The rest of the paper is organized as follows. Section 2 contains basic facts, definitions 
and notation that will remain in effect throughout. It also contains a brief review of earlier 
Poisson approximation results in terms of relative entropy, and the proof of Theorem 1.1. 
Section 3 introduces the two new information functionals: The size-biased information 
Jq^i, generalizing the scaled Fisher information of [21], and the Katti-Panjer information 
Jq,2, generalizing a related functional introduced by Johnstone and MacGibbon in [1'.)]. It 
is shown that, in each case. Properties (A) and (B) analogous to those stated in Section 1.1 
for Fisher information hold for Jq i and Jq,2- In Section 4 we consider Property (C) and 
show that both Jq^i and Jq^2 satisfy natural subadditivity properties on convolution. 
Section 5 contains bounds analogous to that Property (D) above, showing that both 
Jq^i{Y) and Jq,2(^) dominate the total variation distance between the distribution of Y 
and a compound Poisson law. 

2 Size-biasing, compounding and relative entropy 

In this section we collect preliminary definitions and notation that will be used in subse- 
quent sections, and we provide the proof of Theorem 1.1. 

The compounding operation in the definition of the compound Poisson law in the 
Introduction can be more generally phrased as follows. [Throughout, the empty sum 
X]i'=i[- • •] is taken to be equal to zero]. 

Definition 2.1. For any Z^-valued random variable Y ^ R and any distribution Q on 
N, the compound distribution CqR is that of the sum, 

i=l 

where the Xi are i.i.d. with common distribution Q, independent ofY. 

For example, the compound Poisson law CPo(A, Q) is simply CqPo(A), and the com- 
pound binomial distribution CQBin(n,p) is that of the sum 5.„ = ^^=iBiXi where the 
Bi are i.i.d. Bern(p) and the Xi are i.i.d. with distribution Q, independent of the Bi. 
More generally, if the Bi are Bernoulli with different parameters pi, we say that Sn is a 
compound Bernoulli sum since the distribution of each summand BiXi is CQBern(pj). 

Next we recall the size-biasing operation, which is intimately related to the Poisson 
law. For any distribution P on Z+ with mean A, the (reduced) size-biased distribution 

Recalling that a distribution P on Z+ satisfies the recursion, 

{k + l)P{k + l) := \P{k), keZ+, (2.1) 
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if and only if P = Po(A), it is immediate that P = Po(A) if and only if P = P*. This also 
explains, in part, the definition (1.1) of the scaled Fisher information in [21]. Similarly, 
the Katti-Panjer recursion states that P is the CPo(A, Q) law if and only if, 

k 

kP{k) = Xj2jQ{j)nk-j), keZ+; (2.2) 
i=i 

see the discussion in [18] for historical remarks on the origin of (2.2). 

Before giving the proof of Theorem 1 . 1 we recall two results related to Poisson approx- 
imation bounds from [21]. First, for any random variable X ~ P on with mean A, a 
modified log-Sobolev inequality of [9] was used in [21, Proposition 2] to show that, 

P>(P||Po(A)) < MX), (2.3) 

as long as P has either full support or finite support. Combining this with the subadditivity 
property of J,r and elementary computations, yields [21, Theorem 1] that states: If r„ is 
the sum of n independent Pj ~ Bern(pj) random variables, then, 

1 " 3 

I)(PtJ|Po(A))<-^-^, (2.4) 
A ^ 1 - Pi 

1=1 

where Pt„ denotes the distribution of Tn and A = ^l^i Pi- 
Proof of Theorem 1.1. Let Zn ~ Po(A) and = Yll=i ^i- Then the distribution of Sn is 
also that of the sum Yli=i -^i'^ similarly, the CPo{X,Q) law is the distribution of the sum 
Z = J2i=i -^i- Thus, writing X = {Xi), we can express Sn = /(X, r„) and Z = /(X, Z„), 
where the function / is the same in both places. Applying the data-processing inequality 
and then the chain rule for relative entropy [11], 

D{PsACVo{\Q)) < D{P^,tJPx,zJ 

= [Y^D{PxMPxj]+DiPTjPzJ 

i 

= D(PtJ|Po(A)), 

and the result follows from the Poisson approximation bound (2.4). □ 
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3 Information functionals 



This section contains the definitions of two new information functionals for discrete random 
variables, along with some of their basic properties. 

3.1 Size-biased information 

For the first information functional we consider, some knowledge of the summation struc- 
ture of the random variables concerned is required. 

Definition 3.1. Consider the sum S = 5^ ~ -P of n independent Z^-valued random 
variables Yi ^ Pi = Cq.Ri, i = 1,2, . . . ,n. For each j, let Y- ~ Cq.{R^) he independent 
of the Yi, and let S^^^ ~ P^^^ be the same sum as S but with Y- in place ofYj. 

Let qi denote the mean of each Qi, pi = E{Yi)/qi and A = 'YliPi- Then the size-biased 
information of S relative to the sequence Q = {Qi) is, 

Jq,i(5) := \E[ri{S;P,Qf], 

where the score function ri is defined by, 

ri(.;P,Q):=^^f^^-l, seZ^. 
XP{s) 

For simplicity, in the case of a single summand 5* = Yi ~ Pi = CqR we write ri(-; P, Q) 
and Jq^i{Y) for the score and the size-biased information of S, respectively. [Note that 
the score function ri is only infinite at points x outside the support of P, which do not 
affect the definition of the size-biased information functional.] 

Although at first sight the definition of Jq,i seems restricted to the case when all 
the summands Yi have distributions of the form Cq.Ri, we note that this can always be 
achieved by taking pi = Pr{5^ > 1} and letting Ri ~ Bern(pj) and Qi{k) = Pr{l^ = 
k\Yi ^ 1}) for /c > 1, as before. 

We collect below some of the basic properties of Jq^i that follow easily from the 
definition. 

1. Since E\ri{S;P,Q)\ = 0, the functional Jq,i(5') is in fact the variance of the score 
r-i(S;P,Q). 

2. In the case of a single summand 5 = li ~ CqR, if Q is the point mass at 1 then 
the score ri reduces to the score function pY in (1.2). Thus Jq,i can be seen as a 
generalization of the scaled Fisher information J^r of [21] defined in (1.1). 

3. Again in the case of a single summand 5" = Yi ~ CqR, we have that ri(s; P, Q) = 
if and only if i?* = R, i.e., if and only if R is the Po(A) distribution. Thus in this 
case Jq^i{S) = if and only if 5 ~ CPo(A, Q) for some A > 0. 
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4. In general, writing F^^^ for the distribution of the leave-one-out sum ^j^iYi, 

Hence within the class of ultra log-concave Ri (a class which includes compound 
Bernoulli sums), since the moments of Ri are no smaller than the moments of 
with equality if and only if Ri is Poisson, the score ri(-; P, Q) = if and only if the 
Ri are all Poisson, i.e., if and only if P is compound Poisson. 



3.2 Katti-Panjer information 

Recall that the recursion (2.1) characterizing the Poisson distribution was used as part of 
the motivation for the definition of the scaled Fisher information in (1.1) and (1.2). In 
an analogous manner, we employ the Katti-Panjer recursion (2.2) that characterizes the 
compound Poisson law to define another information functional. 

Definition 3.2. Given a Z^-valued random variable Y ^ P and an arbitrary distribution 
Q on N, the Katti-Panjer information of Y relative to Q is defined as, 

jQ,2(y) ■.= E[r2iY;P,Q)% 



where the score function r2 is, 



P{y) 



and where A is the ratio of the mean ofY to the mean of Q. 

From the definition of the score function r2 it is immediate that, 

E[r2{Y;P,Q)] = Y.P{y)r2{y; P,Q) 



A 



y:P(y)>0 j 



j 



jQU) 



jQ{j)Piy - j) 



E{Y) = 0, 



E{Y) 



therefore Jq^2(Y) is equal to the variance of r2{Y; P, Q). [This computation assumes that 
P has full support on Z+; see the last paragraph of this section for further discussion of 
this point.] Also, in view of the Katti-Panjer recursion (2.2) we have that Jq^2(Y) = if 
and only if r2(y; P, Q) vanishes for all y, which happens if and only if the distribution P 
of y is CPo(A,Q). 

In the special case when Q is the unit mass at 1, the Katti-Panjer information of y ~ P 
reduces to. 



Jq,2{Y) = E 



(\P{Y -I) 



P{Y) 



Y 



\^I{Y) + (^2 - 2A), 



(3.1) 
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where A, o"^ are the mean and variance of Y, respectively, and I{Y) denotes the functional, 



proposed by Johnstone and MacGibbon [19] as a discrete version of the Fisher information 
(with the convention P{—1) = 0). Therefore, in view of (3.1) we can think of Jq,2(^) as 
a generalization of the "Fisher information" functional I{Y) of [19]. 

Finally note that, although the definition of Jq,2 is more straightforward than that 
of Jq,i, the Katti-Panjer information suffers the drawback that - like its simpler version 
I{Y) in [19] - it is only finite for random variables Y with full support on Z4.. As noted in 
[20] and [21], the definition of I{Y) cannot simply be extended to all Z_(_-valued random 
variables by just ignoring the points outside the support of P, where the integrand in 
(3.2) becomes infinite. This was, partly, the motivation for the definition of the scaled 
scored function J^- in [21]. Similarly, in the present setting, the important properties of 
Jq,2 established in the following sections fail unless P has full support, unlike for the 
size-biased information Jq,i. 

4 Subadditivity 

The subadditivity property of Fisher information (Property (C) in the Introduction) plays 
a key role in the information-theoretic analysis of normal approximation bounds. The 
corresponding property for the scaled Fisher information (Proposition 3 of [21]) plays an 
analogous role in the case of Poisson approximation. Both of these results are based on 
a convolution identity for each of the two underlying score functions. In this section we 
develop natural analogs of the convolution identities and resulting subadditivity properties 
for the functionals Jqj and Jq,2- 

4.1 Subadditivity of the size-biased information 

The proposition below gives the natural analog of Property (C) in the the Introduction, for 
the information functional Jq,i. It generalizes the convolution lemma and Proposition 3 



Proposition 4.1. Consider the sum 5.„ = XlILi ^ ^ P of n independent Z^-valued 
random variables Yi ^ Pi = Cq.Ri, i = 1,2, ... ,n. For each i, let qi denote the mean of 
Qi, Pi = E{Yi)/qi and A = ^iPi- Then, 




(3.2) 



of [21]. 



n 



ri{s-P,Q)=E ^^ri{Y,-P„Qi) Sn = s , 



(4.1) 



. i=l 



and hence. 



m 




(4.2) 



1=1 



11 



Proof. In the notation of Definition 3.1 and tlie subsequent discussion, writing F^'^ = 
Pi * ... * Pi-i * Pi+i *...*Pm, so that P(^) = F(*) * Cq^R*, the right-hand side of the 
projection identity (4.1) equals, 

^ P^(x)F('){s-x) (p, ( CQ^Rfjx) \\ 
= n{s;P,Q), 

as required. The subadditivity result follows using the conditional Jensen inequality, 
exactly as in the proof of Proposition 3 of [21]. □ 

Corollary 4.2. Under the assumptions of Proposition 4.1, if each Yi = BiXi, where 
Bi ~ Bern(pj) and Xi ~ Qi where pi = Pr{yi / 0} and Qi{k) = Prjli = k\Yi > 1}, then, 



1 " 

Jq,i(5„)<-J]- 



Pi 



i=l 

where A = YliPi- 

Proof. Consider Y = BX, where B ^ R = Bern(p) and X ^ Q. Since i?* = 5q then 
Cq{R'^) = 5q. Further, Y takes the value with probability (1 —p) and the value X with 
probability p. Thus, 

( r J, Cq{R#){x) 
r,ix;CQR,Q) = ^r;^ " 1 



{1 - p)6o{x) + pQix) 
1^ for X = 
-1 for X > 0. 



Consequently, 

jQ^y) = r— > (4.3) 

1 — p 

and using Proposition 4.1 yields. 



n n 3 



1=1 i=l 

as claimed. □ 



12 



4.2 Subadditivity of the Katti-Panjer information 

When Sn is supported on the whole of , the score r2 satisfies a convolution identity and 
the functional Jq.2 is subadditive. The following Proposition contains the analogs of (4.1) 
and (4.2) in Proposition 4.1 for the Katti-Panjer information Jq,2(^)- These can also 
be viewed as generalizations of the corresponding results for the Johnstone-MacGibbon 
functional I(Y) established in [19]. 

Proposition 4.3. Consider a sum Sn = Y17=i '^f independent random variables Yi = 
BiXi, where each Xi has distribution Qt on N with mean qi, and each Bi ~ Bern(pj). Let 
A = 'Yll^=iPi ^"■^ Q — Sr=i ^Qi- Yi is supported on the whole ofZ^, then, 



r2{s]Sn,Q) = E 



^r2(y.;P„Qi 



1=1 



Sn 



and hence, 



jQ,2{Sn)<Y.jQ,,2{Y). 



i=l 



Proof. Write r2,i(-) for r2{-; Pi,Qi), and note that E{Yi) = piQi, for each i. Therefore, 
E{Sn) = '^iPiQi which equals A times the mean of Q. As before, let F^*) denote the 
distribution of the leave-one-out sum '}2j^i'^j-> ^'^^ decompose the distribution Ps^ of Sn 
as Ps„{s) = Y.X Pi{x)F^^{s -x).We have. 



r2{s;Sn,Q) 



PsAs) 

j^PiT.T=i3Q^(j)PsAs-j) 

1=1 

n 



EE 



PsM 

P^ET=JQ^ij)P^ix-j)F(^is-x) 



i=l X 
n 

EE 

i=l X 

E 



Ps^s) 



Fi(j;)F»(s ■ 
PsAs) 



Pi Yl'jLljQi{3)Pi{x - j) 



E ^2,(^0 



1=1 



proving the projection identity. And using the conditional Jensen inequality, noting that 
the cross-terms vanish because E[r2{X;P,Q) = 0] for any X ~ P with full support (cf. 
the discussion in Section 3.2), the subadditivity result follows, as claimed. □ 
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5 Information functionals dominate total variation 



In the case of Poisson approximation, the modified log-Sobolev inequality (2.3) directly 
relates the relative entropy to the scaled Fisher information J^. However, the known 
(modified) log-Sobolev inequalities for compound Poisson distributions [32, 22], only relate 
the relative entropy to functionals different from Jq i or Jq,2- Instead of developing 
subadditivity results for those other functionals, we build, in part, on some of the ideas 
from Stein's method and prove relationships between the total variation distance and the 
information functionals Jq,i and Jq,2- (Note, however, that Lemma 5.4 does offer a partial 
result showing that the relative entropy can be bounded in terms of Jq.i-) 

To illustrate the connection between these two information functionals and Stein's 
method, we find it simpler to first examine the Katti-Panjer information. Recall that, for 
an arbitrary function h : — )• M, a function g : Z_|_ — )• M satisfies the Stein equation for 
the compound Poisson measure CPo(A, Q) if, 

oo 

Xj2jQij)9{k+j) = kg{k)+[h{k)-E[h{Z)]\, g{0) = 0, (5.1) 
i=i 

where Z ~ CFo{X,Q). [See, e.g., [13] for details as well as a general review of Stein's 
method for Poisson and compound Poisson approximation.] Letting h = Ia for some 
A C writing gA for the corresponding solution of the Stein equation, and taking 
expectations with respect to an arbitrary random variable K ~ P on Z_|_, 



P{A) - Pr{Z eA} = E< 
Then taking absolute values and maximizing over all A C 



dTv(^,CPo(A,Q)) < sup 

AcZ+ 



E\x^jQ{j)gAiY + j)-YgAiY) 



(5.2) 



Noting that the expression in the expectation above is reminiscent of the Katti-Panjer 
recursion (2.2), it is perhaps not surprising that this bound relates directly to the Katti- 
Panjer information functional: 

Proposition 5.1. For any random variable Y ^ P on Z+, any distribution Q onN and 
any A > 0, 



dTv(^',CPo(A,Q)) < H{X,Q)^/Jq,2{Y), 
where H{X,Q) is the Stein factor defined in (1.4). 
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Proof. We assume without loss of generality that Y is supported on the whole of since, 
otherwise, Jq,2(^) = oo and the result is trivial. Continuing from the inequality in (5.2), 



dTv(^',CPo(A,Q)) < 



sup \\gA\\oo 



E 

y=0 



AX;iQ(j)P(y-j)-y^(y) 



< H{X,Q)^Piy)\r2{y;P,Q)\ 



< H{\,Q) 



\ 



y=0 



where the first inequality follows from rearranging the first sum, the second inequality 
follows from Lemma 5.2 below, and the last step is simply the Cauchy-Schwarz inequality. 

□ 

The following uniform bound on the sup-norm of the solution to the Stein equation 
(5.1) is the only auxiliary result we require from Stein's method. See [5] or [13] for a proof. 

Lemma 5.2. If gA is the solution to the Stein equation (5.1) for g = I^, with A C 
then \\gA\\oo < H{^,Q), where H{X,Q) is the Stein factor defined in (1.4). 



5.1 Size-biased information dominates total variation 

Next we establish an analogous bound to that of Proposition 5.1 for the size-biased in- 
formation Jq,i. As this functional is not as directly related to the Katti-Panjer recursion 
(2.2) and the Stein equation (5.2), the proof is technically more involved. 

Proposition 5.3. Consider a sum S = Yl^i=i^i ^ P of independent random variables 
Yi = BiXi, where each Xi has distribution Qi on N with mean qi, and each Bi ~ Bern(pj). 
Let A = Yjl=iPi "''^'^ Q = Y17=i T^«- '^hen, 

dTYiP, CPo(A, Q)) < H{X, Q)q {j\Jci,i{S) + I?(Q)) , 

where H{X,Q) is the Stein factor defined defined in (1.4), q = {1/ X)'}2iPiQi '^^ the mean 
of Q, and -D(Q) is the measure of the dissimilarity between the distributions Q = (Qi), 
defined in (1.3). 

Proof. For each i, let T^*) ~ F^^^ denote the leave-one-out sum X^^yj Yi, and note that, as 
in the proof of Corollary 4.2, the distribution F^^^ is the same as the distribution P^*-* of 
the modified sum in Definition 3.1. Since Yi is nonzero with probability pi, we have, 
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for each i, 

oo oo 

j=l s=0 

oo oo 

j=l s=0 

where, for A C arbitrary, denotes the solution of the Stein equation (5.1) with 
h = Ia- Hence, summing over i and substituting in the expression in the right-hand-side 
of equation (5.2) with S in place of Y, yields. 



E[xY,jQij)9A{S + j)}-E[SgA{S)] 



CO OO 



OO OO / 

Y.Y.9A{s + j) XjQ{j)P{s) - Y.P^jQ^{j)P^\ 
s=0 j=l V i 



oo oo 



s=Oj=l \ i / 

0000/ \ 

+Y.T.aA{s+j)[Y,Pij{Q{j)-Qmp^''{s)\ 

s=0 i=l \ i / 

00 00 / \ 

+ Y.Y.9A{s + J)[Y,P^J{Q{j)-Qi{3))P^'^{s)\. (5.3) 

s=0 j=l \ i / 

By the Cauchy-Schwarz inequality, the first term in (5.3) is bounded in absolute value by. 



lxJ29A{s + jrjQU)P{s] 



\ 



j,s 



2 



and for the second term, simply bound H^aHoo by H{X,Q) using Lemma 5.2, deducing a 
bound in absolute value of 

H{X,Q)Y,pd\Qij)-QiU)\- 

Combining these two bounds with the expression in (5.3) and the original total- 
variation inequality (5.2) completes the proof, upon substituting the uniform sup-norm 
bound given in Lemma 5.2. □ 
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Finally, recall from the discussion in the beginning of this section that the scaled Fisher 
information J^^ satisfies a modified log-Sobolev inequality (2.3), which gives a bound for 
the relative entropy in terms of the functional J-,^. For the information functionals Jq^i 
and Jq^2 considered in this work, we instead established analogous bounds in terms of 
total variation. However, the following partial result holds for Jq,i: 

Lemma 5.4. Let Y = BX ~ P, where B ~ Bern(p) and X ^ Q on N. Then: 

D{P\\CVo{p,Q))<Jq,i{Y). 

2 

Proof. Recall from (4.3) that Jq iiY) = -^rjy Further, since the CPo(p, Q) mass function 
at s is at least e~^pQ{s) for s > 1, we have, L)(CQBern(p)||CPo(p, Q)) < (1 — p)log(l — 
p) + p, which yields the result. □ 



6 Comparison with existing bounds 

In this section, we compare the bounds obtained in our three main results. Theorems 1.1, 1.2 
and 1.4, with inequalities derived by other methods. Throughout, 5„ = X^iLi ^ ~ 
^"^^ BiXi, where the Bi and the Yi are independent sequences of independent random 
variables, with Bi ~ Bern(pj) for some Pi G (0,1), and with Xj ~ Qi on N; we write 

>' = E7=i Pi- 
There is a large body of literature developing bounds on the distance between the dis- 
tribution Ps^ of Sn and compound Poisson distributions; see, e.g., [13] and the references 
therein, or [29, Section 2] for a concise review. 

We begin with the case in which all the Qi = Q are identical, when, in view of a remark 
of Le Cam [24, bottom of p. 187] and Michel [27], bounds computed for the case Xi = 1 
a.s. for all i are also valid for any Q. One of the earliest results is the following inequality 
of Le Cam [23], building on earlier results by Khintchine and Doeblin, 

n 

dTYiPs^,CFo{X,Q))<^pl (6.1) 

1=1 

Barbour and Hall (1984) used Stein's method to improve the bound to 

n 

dTy{Ps„,CFo{X,Q)) < min{l,A-n5^P?. ^^-2) 

i=l 

Roos [28] gives the asymptotically sharper bound 

<;Tv(P.„.CPo(A.O))<(^l + l|<i^)<,, (6.3) 

where 9 = Y17=i Pi - -^^ ^^^^ setting, the bound (1.6) that was derived from Theorem 1.1 
yields 

/ 1 " 3 \ 

dTv(^'5„,CPo(A,Q))< -j;^ . (6.4) 
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The bounds (6.2) - (6.4) are all derived using the observation made by Le Cam and 
Michel, taking Q to be degenerate at 1. For the application of Theorem 1.4, however, the 
distribution Q must have support the whole of N, so Q cannot be replaced by the point 
mass at 1 in the formula; the bound that results from Theorem 1.4 can be expressed as 

dTv(P5„,CPo(A,Q))<//(A,Q)^K(Q)fjpf^ , with i^(Q) = J] Q(y)y2(^^^-l) . 

(6.5) 

Illustration of the effectiveness of these bounds with geometric Q and equal pi is given in 
Section 6.2. 

For non-equal Qj, the bounds are more complicated. We compare those given in 
Theorems 1.2 and 1.4 with three other bounds. The first is Le Cam's bound (6.1) that 
still remains valid as stated in the case of non-equal Qi. The second, from Stein's method, 
has the form 

n 

dTv(^5„,CPo(A,Q)) < G{X,Q)Y,ipi, (6.6) 



i=l 



see Barbour and Chryssaphinou [6, eq. (2.24)], where Qi is the mean of Qi and G'(A, Q) is 
a Stein factor: '\i jQ{j) is non-increasing, then 



G(A,Q) = min<^ 1, 5 



where 5 = [A{Q(1) - 2(5(2)}]"^ > 0. The third is that of Roos [29], Theorem 2, which 
is in detail very complicated, but correspondingly accurate. A simplified version, valid if 
jQ{j) is decreasing, gives 

dTv(^'5„,CPo(A,Q)) < ^^_2a,)^ ^ (^-^^ 



where 



" / 2 \ 

Q2 = ^5(2pi)p?min(^,^^,lj , 
1=1 



-2e^( 

these bounds in Section 6.3; in our examples, Roos's bounds are much the best. 



— 'l2y>i Qiiy) /Q{y) ^^^^ di^) = 2z e^(e ^ — l + z). We illustrate the effectiveness of 



6.1 Broad comparisons 

Because of their apparent complexity and different forms, general comparisons between the 
bounds are not straightforward, so we consider two particular cases below in Sections 6.2 
and 6.3. However, the following simple observation on approximating compound binomials 
by a compound Poisson gives a first indication of the strength of one of our bounds. 
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Proposition 6.1. For equal pi and equal Qi: 

1. If n > {\/^p{l — p))^^ , then the bound of Theorem 1.1 is stronger than Le Cam's 
bound (6.1); 

2. If p < 1/2, then the bound of Theorem 1.1 is stronger than the bound (6.2); 

3. 7/0.012 < p < 1/2 and n > (\/2p(l — p))~^ 0,^^ satisfied, then the bound of Theo- 
rem 1.1 is stronger than all three bounds in (6.1), (6.2) and (6.3). 

Proof. The first two observations follow by simple algebra, upon noting that the bound of 
Theorem 1.1 in this case reduces to . ^ ; the third is shown numerically, noting that 

here 6 = p. □ 



One can also examine the rate of convergence of the total variation distance between 
the distribution Ps^ and the corresponding compound Poisson distribution, under simple 
asymptotic schemes. We think of situations in which the pi and Qi are not necessarily 
equal, but are all in some reasonable sense comparable with one another; we shall also 
suppose that jQ{j) is more or less a fixed and decreasing sequence. Two ways in which p 
varies with n are considered: 

Regime I. p = X/n for fixed A, and 7t, — )• oo; 

Regime II. p = a/^, so that A = ^/Jln — t- oo as n — t- oo. 



Under these conditions, the Stein factors H(X,Q) are of the same order as 1/^/np. 
Table 1 compares the asymptotic performance of the various bounds above. 



Bound 


dTy{Ps„,CFo{X,Q)) to leading order 


I 


II 


Le Cam (6.1) 


np 


n^^ 


1 


Roos (6.7) 


np^ min(l, l/{np)) 


n~'^ 




Stein's method (6.6) 


np^ min(l, log(np) /np) 




n'^/"^ log n 


Theorem 1.2 


p 


1 


„l/4 


Theorem 1.4 (6.5) 


p 




n-V2 



Table 1: Comparison of the first-order asymptotic performance of the bounds in (6.1), 
(6.6) and (6.7), with those of Theorems 1.2 and 1.4 for comparable but non-equal Qi, in 
the two limiting regimes p x 1/n and p x 1/ ^/n. 

The poor behaviour of the bound in Theorem 1.2 shown above occurs because, for 
large values of A, the quantity -D(Q) behaves much like A, unless the Qi are identical or 
near-identical. 
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6.2 Example. Compound binomial with equal geometries 

We now examine the finite-n behavior of the approximation bounds (6.1) - (6.3) in the 
particular case of equal pi and equal Qi, when Qi is geometric with parameter q > 0, 
QU) = {1 - a)a^-\ j > 1. 

If Q < ^, then {jQ{j)} is decreasing and, with 6 = [X{l — 3a + 2a'^)]~^, the Stein factor 
in (6.5) becomes 

H{X, Q) = min{l, V6{2 - VS)}. 
The resulting bounds are plotted in Figures 1-3. 




alpha 



Figure 1: Bounds on the total variation distance dTy{CQBm{p,Q),CPo{X,Q)) for Q ~ 
Geom(a), plotted against the parameter a, with n = 100 and A = 5 fixed. The values 
of the bound in (6.4) are plotted as o; those in (6.5) as A; those of the Stein's method 
bound in (6.2) as y; and Roos' bounds in (6.3) and (6.7) as x and Kl, respectively. The 
true total variation distances, computed numerically in each case, are plotted as o. 



6.3 Example. Sums with unequal geometries 

Here, we consider finite-n behavior of the approximation bounds (6.1), (6.6) and (6.7) in 
the particular case when the distributions Qi are geometric with parameters > 0. The 
resulting bounds are plotted in Figures 4 and 5. 

In this case, it is clear that the best bounds by a considerable margin are those of 
Roos [29] given in (6.7). 
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n 



Figure 2: Bounds on the total variation distance dTy{CQBm{p,Q),CFo{X,Q)) for Q ~ 
Geom(a) as in Figure 1, here plotted against the parameter a, with n = 100 and A = 5 
fixed. 
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Figure 3: Bounds on the total variation distance dTy{CQBm{p,Q),CFo{X,Q)) for Q ~ 
Geom(a) as in Figure 1, here plotted against the parameter A, with a = 0.2 and n = 100 
fixed. 
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Figure 4: Bounds on the total variation distance dTY{Ps„, CPo(A, Q)) for Qi ~ Geom(ai), 
where Oi are uniformly spread between 0.15 and 0.25, n varies, and p is as in regime I, 
p = 5/n. Again, bounds based on Jq i are plotted as o; those based on Jq^2 as A; Le 
Cam's bound in (6.1) as y; the Stein's method bound in (6.6) as x, and Roos' bound 
from Theorem 2 of [29] as M. The true total variation distances, computed numerically in 
each case, are plotted as o. 
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Figure 5: Bounds on the total variation distance dTY{Ps,^, CPo(A, Q)) for Qi ~ Geom(Qj) 
as in Figure 4, where are uniformly spread between 0.15 and 0.25, n varies, and p is as 
in Regime II, p = y^O.S/n. 
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